Events listing, deletion and archival optimizations#11443
Events listing, deletion and archival optimizations#11443winterhazel wants to merge 7 commits intoapache:mainfrom
Conversation
|
@blueorangutan package |
|
@winterhazel a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress. |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #11443 +/- ##
============================================
+ Coverage 18.01% 18.08% +0.06%
- Complexity 16607 16703 +96
============================================
Files 6029 6036 +7
Lines 542160 542433 +273
Branches 66451 66421 -30
============================================
+ Hits 97682 98100 +418
+ Misses 433461 433313 -148
- Partials 11017 11020 +3
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 14617 |
There was a problem hiding this comment.
Pull Request Overview
This PR optimizes the performance of events listing and deletion operations to reduce response times in environments with large numbers of events. The optimizations include removing unnecessary DISTINCT operations, adding a covering index for better query performance, and implementing batch deletion.
- Replaced DISTINCT with NATIVE function in event listing queries for better performance
- Implemented batch deletion using the existing
delete.query.batch.sizeconfiguration - Added a new covering index on the event table to optimize search queries
Reviewed Changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| ManagementServerImpl.java | Refactored deleteEvents method to use batch deletion and simplified access control logic |
| ConfigurationManagerImpl.java | Updated documentation for DELETE_QUERY_BATCH_SIZE to include events |
| QueryManagerImpl.java | Removed DISTINCT operation from event search query for performance |
| EventDaoImpl.java | Added new purgeAll method for batch deletion and removed unused listOlderEvents method |
| EventDao.java | Added purgeAll method interface and removed listOlderEvents method |
| DomainDaoImpl.java | Added getDomainAndChildrenIds convenience method |
| DomainDao.java | Added getDomainAndChildrenIds method interface |
| DeleteEventsCmd.java | Moved parameter validation from command to service layer and improved error message |
| ListEventsCmd.java | Removed unnecessary blank line |
| Upgrade42010to42020.java | Added database upgrade class to create covering index |
| DbUpgradeUtils.java | Added utility method for creating indexes with custom names |
| schema-42010to42020.sql | Empty schema upgrade file |
| schema-42010to42020-cleanup.sql | Empty schema cleanup file |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
|
@blueorangutan package |
|
@winterhazel a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress. |
|
Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 14637 |
shwstppr
left a comment
There was a problem hiding this comment.
code looks good, will make the delete events action much better. Needs testing.
@winterhazel could we extend this to cover archive events as well? Also, I’m not aware of any other DB changes slated for 4.20.2. If this is the only one, would it be safer to target 4.22? Not blocking—just a suggestion.
@shwstppr I'll have a look into whether event archiving needs or would benefit from any changes. About targeting 4.22 instead: the only DB change here is the addition of an index. This can not cause any inconsistency as far as I know, so it should be safe going to 4.20.2. |
|
Moving to 4.22 milestone as it has some DB changes |
0a8e293 to
73f79b8
Compare
73f79b8 to
f52a619
Compare
|
[SF] Trillian test result (tid-15078)
|
|
This pull request has merge conflicts. Dear author, please fix the conflicts and sync your branch with the base branch. |
|
@blueorangutan package |
|
@winterhazel a [SL] Jenkins job has been kicked to build packages. It will be bundled with no SystemVM templates. I'll keep you posted as I make progress. |
|
Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ el10 ✔️ debian ✔️ suse15. SL-JID 16622 |
|
This pull request has merge conflicts. Dear author, please fix the conflicts and sync your branch with the base branch. |
|
@blueorangutan package |
|
@winterhazel a [SL] Jenkins job has been kicked to build packages. It will be bundled with no SystemVM templates. I'll keep you posted as I make progress. |
|
Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ el10 ✔️ debian ✔️ suse15. SL-JID 17555 |
|
@blueorangutan test |
|
@DaanHoogland a [SL] Trillian-Jenkins test job (ol8 mgmt + kvm-ol8) has been kicked to run smoke tests |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 12 out of 12 changed files in this pull request and generated 8 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| long totalArchived = _eventDao.archiveEvents(ids, type, startDate, endDate, accountId, domainIds, | ||
| ConfigurationManagerImpl.DELETE_QUERY_BATCH_SIZE.value()); | ||
|
|
||
| if (ids != null && events.size() < ids.size()) { | ||
| return false; | ||
| } | ||
| _eventDao.archiveEvents(events); | ||
| return result; | ||
| return totalArchived > 0; |
There was a problem hiding this comment.
When ids is provided, this method reports success if any matching rows were archived. That means a request like ids=[1,2,3] can return success even if only a subset was actually archived (e.g., some IDs don’t exist or aren’t in the caller’s scope). If the API is expected to be all-or-nothing for explicit IDs, consider returning failure (or throwing) when totalArchived != ids.size() when ids is non-empty, or otherwise make the partial-success behavior explicit.
There was a problem hiding this comment.
This was the previous behavior. I do not want to change the response of the API in this patch, just optimize its operations. Edit: there was a change. I will adjust.
| @Override | ||
| public void archiveEvents(List<EventVO> events) { | ||
| if (events != null && !events.isEmpty()) { | ||
| TransactionLegacy txn = TransactionLegacy.currentTxn(); | ||
| txn.start(); | ||
| for (EventVO event : events) { | ||
| event = lockRow(event.getId(), true); | ||
| event.setArchived(true); | ||
| update(event.getId(), event); | ||
| txn.commit(); | ||
| public long archiveEvents(List<Long> ids, String type, Date startDate, Date endDate, Long accountId, List<Long> domainIds, | ||
| long limitPerQuery) { | ||
| SearchCriteria<EventVO> sc = createEventSearchCriteria(ids, type, startDate, endDate, null, accountId, domainIds); | ||
| Filter filter = null; | ||
| if (limitPerQuery > 0) { | ||
| filter = new Filter(limitPerQuery); | ||
| } | ||
|
|
||
| long archived; | ||
| long totalArchived = 0L; | ||
|
|
||
| do { | ||
| List<EventVO> events = search(sc, filter); | ||
| if (events.isEmpty()) { | ||
| break; | ||
| } | ||
| txn.close(); | ||
|
|
||
| archived = archiveEventsInternal(events); | ||
| totalArchived += archived; | ||
| } while (limitPerQuery > 0 && archived >= limitPerQuery); |
There was a problem hiding this comment.
archiveEvents can load all matching events into memory when limitPerQuery <= 0 (the default for delete.query.batch.size is 0), then builds a single UPDATE ... WHERE id IN (...) statement containing every ID. On large event tables this risks OOMs, oversized SQL statements, and very long-running queries. Consider always archiving in batches (even when the config is 0), or implement a batched UPDATE ... LIMIT <batch> loop similar to batchExpunge so the operation remains bounded.
There was a problem hiding this comment.
Ignoring the configuration and doing it in batches anyways will make having a configuration weird. I think that we can just enable batch operations by default instead.
| @Override | ||
| public void archiveEvents(List<EventVO> events) { | ||
| if (events != null && !events.isEmpty()) { | ||
| TransactionLegacy txn = TransactionLegacy.currentTxn(); | ||
| txn.start(); | ||
| for (EventVO event : events) { | ||
| event = lockRow(event.getId(), true); | ||
| event.setArchived(true); | ||
| update(event.getId(), event); | ||
| txn.commit(); | ||
| public long archiveEvents(List<Long> ids, String type, Date startDate, Date endDate, Long accountId, List<Long> domainIds, | ||
| long limitPerQuery) { | ||
| SearchCriteria<EventVO> sc = createEventSearchCriteria(ids, type, startDate, endDate, null, accountId, domainIds); | ||
| Filter filter = null; | ||
| if (limitPerQuery > 0) { | ||
| filter = new Filter(limitPerQuery); | ||
| } | ||
|
|
||
| long archived; | ||
| long totalArchived = 0L; | ||
|
|
||
| do { | ||
| List<EventVO> events = search(sc, filter); | ||
| if (events.isEmpty()) { | ||
| break; | ||
| } | ||
| txn.close(); | ||
|
|
||
| archived = archiveEventsInternal(events); | ||
| totalArchived += archived; | ||
| } while (limitPerQuery > 0 && archived >= limitPerQuery); | ||
|
|
||
| return totalArchived; | ||
| } | ||
|
|
||
| @DB | ||
| private long archiveEventsInternal(List<EventVO> events) { | ||
| final String idsAsString = events.stream() | ||
| .map(e -> Long.toString(e.getId())) | ||
| .collect(Collectors.joining(",")); | ||
| final String query = String.format("UPDATE event SET archived=true WHERE id IN (%s)", idsAsString); | ||
|
|
||
| try (TransactionLegacy txn = TransactionLegacy.currentTxn(); | ||
| PreparedStatement pstmt = txn.prepareStatement(query)) { | ||
| return pstmt.executeUpdate(); | ||
| } catch (SQLException e) { | ||
| throw new CloudRuntimeException(e); | ||
| } | ||
| } | ||
|
|
||
| @Override | ||
| public long purgeAll(List<Long> ids, Date startDate, Date endDate, Date limitDate, String type, Long accountId, | ||
| List<Long> domainIds, long limitPerQuery) { | ||
| SearchCriteria<EventVO> sc = createEventSearchCriteria(ids, type, startDate, endDate, limitDate, accountId, domainIds); | ||
| return batchExpunge(sc, limitPerQuery); | ||
| } |
There was a problem hiding this comment.
New batching semantics in archiveEvents(...)/purgeAll(...) are non-trivial (limit handling, iteration/termination, criteria composition) but there are currently no unit tests covering these paths (unlike other DAO batch-expunge usages in engine/schema/src/test). Consider adding a focused DAO unit test validating that multiple batches are processed correctly and that limitPerQuery <= 0 behaves as intended.
| _accountMgr.checkAccess(CallContext.current().getCallingAccount(), null, false, sameOwnerEvents); | ||
| long totalRemoved = _eventDao.purgeAll(ids, startDate, endDate, null, type, accountId, domainIds, | ||
| ConfigurationManagerImpl.DELETE_QUERY_BATCH_SIZE.value()); | ||
|
|
There was a problem hiding this comment.
When ids is provided, this method reports success if any rows were deleted. This can silently ignore IDs that don’t match the criteria / caller scope while still returning success, which is potentially misleading for callers expecting all requested IDs to be removed. Consider aligning the return semantics with the requested operation (e.g., fail when ids is non-empty and totalRemoved != ids.size(), or explicitly document/return partial results).
| if (CollectionUtils.isNotEmpty(ids)) { | |
| return totalRemoved == ids.size(); | |
| } |
| final String idsAsString = events.stream() | ||
| .map(e -> Long.toString(e.getId())) | ||
| .collect(Collectors.joining(",")); | ||
| final String query = String.format("UPDATE event SET archived=true WHERE id IN (%s)", idsAsString); | ||
|
|
||
| try (TransactionLegacy txn = TransactionLegacy.currentTxn(); | ||
| PreparedStatement pstmt = txn.prepareStatement(query)) { |
There was a problem hiding this comment.
archiveEventsInternal builds SQL by string-concatenating the IDs into an IN (...) clause. Besides the query-size risk, it bypasses parameter binding and can stress SQL parsing/plan caching. Consider using a parameterized statement (placeholders) or a different batching strategy (e.g., update by criteria with LIMIT) to keep statements small and reusable.
| final String idsAsString = events.stream() | |
| .map(e -> Long.toString(e.getId())) | |
| .collect(Collectors.joining(",")); | |
| final String query = String.format("UPDATE event SET archived=true WHERE id IN (%s)", idsAsString); | |
| try (TransactionLegacy txn = TransactionLegacy.currentTxn(); | |
| PreparedStatement pstmt = txn.prepareStatement(query)) { | |
| if (CollectionUtils.isEmpty(events)) { | |
| return 0L; | |
| } | |
| final String placeholders = events.stream() | |
| .map(event -> "?") | |
| .collect(Collectors.joining(",")); | |
| final String query = String.format("UPDATE event SET archived=true WHERE id IN (%s)", placeholders); | |
| try (TransactionLegacy txn = TransactionLegacy.currentTxn(); | |
| PreparedStatement pstmt = txn.prepareStatement(query)) { | |
| for (int i = 0; i < events.size(); i++) { | |
| pstmt.setLong(i + 1, events.get(i).getId()); | |
| } |
There was a problem hiding this comment.
This suggestion does not make sense to me
|
[SF] Trillian test result (tid-15961)
|
DaanHoogland
left a comment
There was a problem hiding this comment.
code looks good, two questions
- can we expect / catch any runtime exceptions in the command objects (
ArchiveEventsandDeleteEvents)? - will further testing (beyond unit testing) be needed?
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@DaanHoogland
Some basic manual testing would be nice. |
|
@blueorangutan test keepEnv |
|
@DaanHoogland a [SL] Trillian-Jenkins test job (ol8 mgmt + kvm-ol8) has been kicked to run smoke tests |
|
[SF] Trillian test result (tid-15998)
|
Description
The
listEventsanddeleteEventsAPIs execute some non-optimized queries, which results in very long response times in environments that have many events.Some optimizations were performed in these two workflows in order to reduce response times. The listing now uses a covering index, and does not perform a unecessary
DISTINCTanymore. The deletion is now performed in batches. The batch size is defined by the global settingdelete.query.batch.size.Types of changes
Feature/Enhancement Scale or Bug Severity
Feature/Enhancement Scale
How Has This Been Tested?
Events listing
In a test environment with 2 million event entries, I called
listEventsmultiple times passing different combinations ofaccount,archived,domainid,duration,enddate,entrytime,id,keyword,level,projectid,resourceid,resourcetype,startdate,startidandtype. For almost all the listings, I got a more reasonable response time (between 0.1 and 3 seconds depending on the filters) compared to before (30+ seconds). The only exception is thekeywordparameter, which is not optimizable with a index because it performs acolumn LIKE "%keyword%"for 3 different columns.Access control
I called
listEventsand verified that:Events deletion
delete.query.batch.size, enabling batch delete;deleteEventsto remove all events before2025-01-01. Then, I verified that they were removed successfully;event.purge.delay, enabling the automatic deletion of events older than 15 days. After the deletion task executed, I verified that events older than 15 days were removed successfully.Access control
I called
deleteEventsand verified that: